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Abstract 

The era of Big Data has spawned unprecedented 
interests in developing hashing algorithms for ef¬ 
ficient storage and fast nearest neighbor search. 
Most existing work learn hash functions that are 
numeric quantizations of feature values in pro¬ 
jected feature space. In this work, we propose 
a novel hash learning framework that encodes 
feature’s rank orders instead of numeric values 
in a number of optimal low-dimensional ranking 
subspaces. We formulate the ranking subspace 
learning problem as the optimization of a piece- 
wise linear convex-concave function and present 
two versions of our algorithm: one with inde¬ 
pendent optimization of each hash bit and the 
other exploiting a sequential learning framework. 
Our work is a generalization of the Winner-Take- 
All (WTA) hash family and naturally enjoys all 
the numeric stability benefits of rank correlation 
measures while being optimized to achieve high 
precision at very short code length. We com¬ 
pare with several state-of-the-art hashing algo¬ 
rithms in both supervised and unsupervised do¬ 
main, showing superior performance in a number 
of data sets. 


1. Introduction 

Massive amount of social multimedia data are being gen¬ 
erated by billions of users every day. The advent of mul¬ 


timedia big data presents a number of challenges and op¬ 
portunities for research and development of efficient stor¬ 
age, indexing and retrieval techniques. Hashing is recog¬ 
nized by many researchers as a promising solution to the 
above Big Data problem, thus attracting significant amount 


of research in the past few years (Datar et al. |2004[ Kulis 

& Darrell 2009| |Tschopp & Diggavi 

2009 |Weiss et al. 

2009 

1. Most hashing algorithms encode high-dimensional 


data into binary codes by quantizing numeric projections 
dNorouzi & Fleet] [2^11 [Liu et al.[ [2^11 |20T4l [2fe| . In 


contrast, hashing schemes based on feature’s ranking order 
(i.e. comparisons) are relatively underresearched and will 
be the focus of this paper. 


Ranking-based hashing, such as Winner-Take-All (WTA) 
(Yagnik et al. 2011|l and Min-wise Hashing (MinHash) 


(Broder et al. 


20001, ranks the random permutation of in¬ 


put features and uses the index of maximal/minimal fea¬ 
ture dimensions to encode a compact representation of the 
input features. The benefit of ranking-based hashing lies 
in the fact that these algorithms are insensitive to the mag¬ 
nitude of features, and thus are more robust against many 
types of random noises universal in real applications rang¬ 


ing from information retrieval (Salakhutdinov & Hinton 
|2007| ), image classification ( |Fan 2013| l to object recogni- 
tion dTorralba et al.| |2008| l. In addition, the magnitude- 
independence also makes the resultant hash codes scale- 
invariant, which is critical to compare and align the features 
from heterogeneous spaces, e.g., revealing the multi-modal 
correlations (|Li et al.[|2014]l. 


Unfortunately, the existing ranking-based hashing is data- 
agnostic. In other words, the obtained hash codes are not 
learned by exploring the intrinsic structure of data distri¬ 
bution, making it suboptimal in its efficiency of coding the 
input features with compact codes of minimal length. For 
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example, WTA encodes the data with the indices of the 
maximum dimensions chosen from a number of random 
permutations of input features. Although WTA has gener¬ 
ated leading performances in many tasks (|Dean et ah 2013 


[Yagnik et al.||20ir||Li et aL]|2014| l, it is constrained in the 
sense that it only ranks the existing features of input data, 
while incapable of combining multiple features to gener¬ 
ate new feature subspaces to rank. A direct consequence 
of such limitation is that this sort of ranking-based hash¬ 
ing usually needs a very large number of permutations and 
rankings to generate useful codes, especially with a high 
dimensional input feature space ([Yagnik et al.[ 201 Ijl. 


To address this challenge, we abandon the use of ranking 
random permutations of existing features in ranking-based 
hashing algorithms. Instead, we propose to generate com¬ 
pact ranking-based hashing codes by learning a set of new 
subspaces and ranking the newly projected features in these 
subspaces. At each step, an input data is encoded by the in¬ 
dex of the maximal value over the projected points onto 
these subspaces. The subspace projections are jointly opti¬ 
mized to generate the ranking indices that are most discrim¬ 
inative to the metric stmcture and/or the data labels. Then 
a vector of codes are iteratively generated to represent the 
input data from the maximal indices over a sequence of sets 
of subspaces. 


This method generalizes ranking-based hashing from re¬ 
stricted random permutations to perform encoding by rank¬ 
ing a set of arbitrary subspaces learned by mixing multiple 
original features. This greatly extends its flexibility so that 
much shorter bits can be generated to encode input data, 
while retaining the benefits of noise insensitivity and scale 
invariance inherent in such algorithms. 


In the remainder of this paper, we first review the related 
hashing algorithms in Section then the rank subspace 
hash learning problem is formulated and solved in Section 
In Section]^ we present an improved learning algorithm 
based on sequential learning. The experimental results are 
presented in Sectionj^and the paper is concluded in Section 


2 . Related Work 


Since the focus of this paper is on data-dependent hashing, 
we limit our review to two categories of this line of research 
- unsupervised and supervised hashing. Interested reader 
can refer to (Bondugula 20I3| l and (Wang et al. 20I4| l for 
a comprehensive review of this research area. 

The most representative works on unsupervised hashing in¬ 
cludes Spectral Hashing (jWeiss et al.j 


ized variant Locality Sensitive Hashing (Kulis & Grauman 


20091 and kernel- 


20091. In detail, SH learns linear projections through an 


eigenvalue decomposition so that distance between pairs 


of similar training samples are minimized in the projected 
subspaces when these samples are binarized. Similarly, 
KLSH also makes use of an eigensystem solution, but it 
manipulates data items in the kernel space in an effort to 
generalize LSH to accommodate arbitrary kernel functions. 


Binary Reconstractive Embedding (BRE) (Kulis & Dar- 


rellj 2009|l explicitly minimizes the reconstruction error be¬ 


tween the input space and Hamming space to preserve the 
metric structure in input space, which has demonstrated im¬ 
proved performance over SH and LSH. Iterative Quantiza¬ 
tion (ITQ) ( [Gong & Lazebnik][201 Ij l iteratively learns un¬ 
correlated hash bits to minimize quantization error between 
hash bodes and the dimension-reduced data. (jWang et'aL 


20121 presents a sequential projection learning method that 


fits the eigenvector solution into a boosting framework and 
uses pseudo labels in learning each hash bit. More recently. 
Anchor Graph Hashing (AGH) (Liu et al. 201 Ij l and Dis¬ 
crete Graph Hashing (DGH) (Liu et al. 2014| l use anchor 
graphs to capture the neighborhood structure inherent in a 
given dataset and adopt a discrete optimization procedure 
to achieve nearly balanced and uncorrelated hash bits. 

On the other hand, supervised hashing methods take advan¬ 
tage of the data labels to learn data-dependent hash func¬ 
tions. It has been shown that supervised hashing methods 
are good at incorporating data labels to learn more discrim¬ 
inative hash codes in many tasks. Eor example, ( jSalakhut^ 
jdinov & Hinton) |2007| ) uses Restricted Boltzman Machine 
(RBM) to learn nonlinear binary hash codes for document 
retrieval and demonstrates better precision and recall than 
other methods. Similar deep learning hash methods have 
also been applied to the task of image retrieval in very large 
databases ( jTorralba et al.j [2008 j l. However, deep learn¬ 
ing methods typically need large data sets, cost long train¬ 
ing times and have been outperformed by other methods 
exclusively designed for learning hashing functions. Eor 


instance, (Norouzi & Fleet 20111 proposes the Minimal 
Loss Hashing (MLH) with a stmctural SVM-like formu¬ 
lation and minimizes the loss-adjusted upper bound of a 
hinge-like loss function defined on pairwise similarity la¬ 
bels. The resulting hash codes have shown to give superior 
performance over the state-of-the-art. This method is fur¬ 
ther extend to minimize loss functions defined with triplet 
similarity comparisons (Norouzi et al. 2012| ). Similarly, 
( Li et al.j 2013| l also learns hash functions based on triplet 
similarity. On the contrary, the formulation is a convex 
optimization within the large-margin learning framework 
rather than structural SVM. 


Recently, ( pan 2013| l theoretically proved the convergence 
properties of arbitrary sequential learning algorithms and 
proposed the Jensen Shannon Divergence (JSD) sequen¬ 
tial learning method with a multi-class classification for¬ 
mulation. Supervised Hashing with Kernels (KSH) is an¬ 
other sequential learning algorithm. This method maps 
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the data to compact hash codes by minimizing Hamming 
distances of similar pairs and maximizing that of dissim¬ 
ilar pairs simultaneously ( |Liu et ah 2012| l. The sequen¬ 
tial part of our algorithm is similar to Boosting Similar¬ 
ity Sensitive Coding (BSSC) ( Shakhnarovich et al.| 12003] ) 
and Forgiving Hash (FH) (Baluja & Coveil 2008) 1, both of 
which treat each hash bit as a week classifier and learn a 
series of hash functions in a AdaBoost framework. How¬ 
ever, the rank-based hash function we learn at each step 
is significantly different from that in BSSC and FH, result¬ 
ing in completely different objective functions and learning 
steps. Existing hashing schemes based on rank orders (e.g. 
( Tschopp & Digga^ |2009) l, (Pele & Werman 2008) 1 and 
(Ozuysal et al. )2007) l) are mostly restricted to approximat¬ 
ing nearest neighbors given a distance metric to speed up 
large scale lookup. To the best of our knowledge, there has 
been no previous work explicitly exploiting the rank-based 
hash functions in a supervised hash learning setting. 


3. Formulation 


3.1. Winner-Take-All Hashing 


The WTA hashing is a subfamily of hashing functions in¬ 
troduced by ( Yagnik et al.) 201 l) l. WTA is specified by two 
parameters: the number of random permutations L and the 
window size K. Each permutation tt rearranges the entries 
of an input vector x S to Xjr in the order specified by tt. 
Then the index of the maximum dimension of the feature 
among the first K elements of x^ is used as the hash code. 
This process is repeated L times, resulting in a AT-nary hash 
code of length L, which can be compactly represented us¬ 
ing L X [log 2 K~\ bits. 


WTA is considered as a ranking-based hashing algorithm, 
which uses the rank order among permuted entries of a vec¬ 
tor rather than their values of features. This property has 
given WTA certain degree of stability to perturbations in 
numeric values. Thus the WTA hash codes usually gener¬ 
ate more robust metric structure to measure the similarity 
between input vectors than other types of hash codes which 
often contain inherent noises from quantizing the input fea¬ 
ture spaces. With theoretical soundness, however, the hash 
codes generated by WTA often must be sufficiently long to 
represent the original data in high fidelity. 


This is caused by twofold limitations: (1) the entries of in¬ 
put vectors are permuted in a random fashion before the 
comparison is applied to find the largest entry out of the 
first K ones; (2) the comparison and the ranking are re¬ 
stricted to be made between the original features. The ran¬ 
dom permutations are very inefficient to find the most dis¬ 
criminative entries to compare the similarity between the 
input vectors, and the restriction of the ranking to original 
features is too strong to generate the compact representa¬ 


tions. In the next, we relax the two limitations. 


3.2. Rank Subspace Hashing 

Rather than randomly permuting the input data vector x, 
we project it onto a set of K one-dimensional subspaces. 
Then the input vector is encoded by the index of the sub¬ 
space that generates the largest projected value. In other 
words, we have 

/i(x;W)=arg max wf’x, (1) 


where G R‘^, 1 < k < K are vectors specifying the 
subspace projections, and W = [wi, W 2 , • • • , 


We use a linear projection to map an input vector into sub¬ 
spaces to form its hash code. At first glance, this idea is 
similar to the family of learning-based hashing algorithms 
based on linear projection ()Datar et'af 2004) )Norouzi 


Elect) 20111. However, different from these existing al¬ 


gorithms, the proposed method instead ranks the obtained 
subspaces to encode each input vector with the index of 
the dimension with the maximum value. This makes the 
obtained hash codes highly nonlinear to the input vector, 
invariant to the scaling of the vector, as well as insensitive 
to the input noises to a larger degree than the linear hashing 
codes. In this paper, we name this method Rank Subspace 
Hashing (RSH) to distinguish it from the other compared 
methods. 


WTA is a special case of the RSH algorithm, if we restrict 
the projections onto K axis-aligned linear subspaces, i.e., 
Wfc is set to a column vector randomly chosen from an 
identity matrix I of size d x d. 

RSH extends WTA by relaxing the axis aligned linear sub¬ 
spaces in Q to arbitrary AT-dimensional linear subspaces 
in Such relaxation greatly increases the flexibility to 
learn a set of subspaces to optimize the hash codes result¬ 
ing from the projections to these subspaces. 

Now our objective boils down to learn hash functions char¬ 
acterized by the projections W as in Eq. Q. Specifically, 
let T) be the set of N d-dimensional data points {x^}^^ 
and let S — {sij}i<ij<N be the set of pair-wise similar¬ 
ity labels satisfying Sij G {0,1}, where Sij = 1 means 
the pair (xi,Xj) is similar and vice verse. The pair-wise 
similarity labels S can be obtained either from the nearest 
neighbors in a metric space or by human annotation that 
denotes whether a pair of data points come from the same 
class. 


Given a similarity label Sij for each training pair, we can 
define an error incurred by a hash function like Q below 


hj , Sij ) 


pi {hi ^ hj^^ Sij — 1 

\{l- I{hi^hj)), Sij=0 
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where /(•) is the indicator function outputting 1 when the 
condition holds and 0 otherwise, is for 

short, and p and A are two hyper-parameters that penalize 
false negative and false positive respectively. 

The learning objective is to find W to minimize the cumu¬ 
lative error function over the training set: 

^^(W)= ^ eih,,h„s,j) (3) 

Note that W factors into the above objective function be¬ 
cause both hi and hj are a function of W. 

3.3. Reformulation 

The above objective function is straightforward to formu¬ 
late, but hard to optimize because it involves the indica¬ 
tor function and arg max function which are typically non- 
convex and highly discontinuous. Motivated by ( [Norouzi 
[2011| |, we reformulate the objective function and 
seek a piecewise linear upper bound of E(W). 

First, the hash function in Q can be equivalently reformu¬ 
lated as 

h(x;W) = argmaxg^Wx, 

® (4) 

subject to g S {0,1}^, l^g = 1, 


& Fleet 


function with respect to W 


^(W) = V {max[e(gi,gj ,Sij) -f gfWxi-f gJWxj 

< * cr- cr • 




gi,gj 


-hfWx,-hJWxj} 


3.4. Optimization 

Consider W is fixed. The first step is a discrete opti¬ 
mization problem that is guaranteed to have global opti¬ 
mal solution. Specifically, given the values of Wxjj^, the 
RSH codes hj(j) in the second and third term of (|5|l can 
be found straightforwardly in 0{K). For the adjusted er¬ 
ror e(gi, gj, Sij) + gfWxi -f gJWxj of the first term in 
the square bracket, it is not hard to derive its maximum 
value can be obtained by scanning the elements in matrix 
[ mki ] KxK , defined as 

^ f 2/f ^ + yf + A(1 - s,j) if k = l 
I + yf'^ + psij otherwise 

where is the element of Wx^. Assuming the 
(A:*,/*)th element of the above matrix achieves the max¬ 
imum value, the maxima (g* ,g*) of the adjusted error are 
l-of-AT binary vectors with the fc*th and the Tth dimension 
set to 1. The above procedure can be computed in O(K^). 
Since K is normally very small (e.g. 2 to 8), the above 
discrete optimization problem can be computed efficiently. 


which outputs an 1-of-iT binary code h for an input feature 
vector X. The constraint enforces there must exist and only 
exist a nonzero entry of 1 in the resultant hash code. We en¬ 
force this constraint in the following optimization problems 
without meaning it explicitly to avoid notational clutter. It 
is easy to find the equivalence to the hashing function Q: 
the only nonzero element in h encodes the index of dimen¬ 
sion with the maximum value in Wx. 

Given a pairwise similarity label between two vectors 
Xi and Xj, and are their hash codes obtained by solv¬ 
ing the arg max problem (i.e. h(xi; W) and h(xj; W)). 
Then the error function ([T]) can be upper bounded by 

e(h,,hj,s,j) <max[e(gi,gj,s^j) +gfWx, -f gJ’Wxj] 
- h[Wx, - hJWxj 

This inequality is easy to prove by noting that the following 
inequality 

maxg,^g^ [e(g*, gj, s„) -f gfWx* -f gJWx^] 

> e{hi,hj,Sij) + hfWxj -f hJWx^ 

With the above upper bound of error function, we seek to 
solve the MinMax problem of minimizing the following 


Now consider the optimization of W. Fixing the maxima 
(g* ,g*) of the first term, and the RSH codes and hj 
in Q, W can be updated in the direction of the negative 
gradient 

- )xf -f (hj - g* )xj (6) 

i,3 

Batch update can be made using (j^ when the training data 
can be loaded into the memory all at once. Otherwise, W 
can also be done in an online fashion with one training pair 
at a time, leading to the following iterative learning proce¬ 
dure 

w ^ W + 4ih, - g*)xf + (h, - g*)xj], (7) 

where p is the learning rate. 

The learning algorithm is shown as Algorithm[2 The algo¬ 
rithm learns L projection matrices by starting with different 
random initializations from Gaussian distribution. Because 
the convex-concavity nature of the objective function, the 
solutions have multiple local minima. This is a desired 
property in our application, because each local minimum, 
corresponding to a RSH function, reflects a distinct per¬ 
spective of ranked subspaces underlying the training exam¬ 
ples. In addition, each hash function is learned indepen¬ 
dently and thus can be done in parallel. The convergence 
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Algorithm 1 Rank Subspace Learning 

Input: data [x^], pairwise similarity labels [s^], length 
of hash code L, subspace dimension K 

for Z = 1 to L do 

Initialize Wk, I < k < K from Gaussian distribution 

repeat 

Pick a pair (x^, Xj) and compute h^, hj, g*, g* 
Update projection matrix W according to 
W ^ W + r)[(hi - g*)xf + {hj - gpxj] 
until Convergence 

end for 


Algorithm 2 Sequential Rank Subspace Learning 

Input: data [x^], pairwise similarity labels[sy], length 
of hash code L, subspace dimension K 
Initialize: set all the sampling weights {aij} to 1 

for Z = 1 to L do 

Initialize W; from Gaussian distribution 

repeat 

Pick a pair (xi,Xj) and compute h^, hj, g*, g* 
based on the current estimate of Wj; 

Update projection matrix W/ according to 
Wi ^Wi+ [(h, - g* )xf + (hj - g* )xj] ; 

until Convergence 
Compute the weighted errors 


of the learning algorithm has been explored and empirically 
studied in ( |McAllester et al.||2010[[Norouzi & Fleet||20lT] l. 

4. The Sequential Learning 


e; = 


ctj- ■^e(hi, hj, Sij) 


E 


(0 
Q • 


Evaluate the quantity 


In Algorithm [T] since each hash function is learned inde¬ 
pendently, the entire hash code may be suboptimal. This 
is because different random starting points may lead to the 
same local minima, resulting in redundant hash bits. In or¬ 
der to maximize the information contained in a L-bit hash 
code, we propose to learn the hash functions sequentially 
so each hash function can provide complementary infor¬ 
mation to previous ones. 

In order to motivate our sequential learning algorithm, we 
can view each hash bit as a week classifier that assigns 
similarity labels to an input pair, and the obtained ensem¬ 
ble classifier is related with the Hamming distance between 
hashing codes. Formally, each week classifier correspond¬ 
ing to the bit is 


simi{xi,Xj) = 1 - i?m(h(xj; Wi),h(xj; W;)) (8) 

Where Hm{x,y) = I(x ^ ?/) is the bitwise Hamming 
distance, and W; is the projection matrix for this bit. Then, 
the Hamming distance between two L-bit hash codes can 
be seen as the vote of an ensemble of L week classifiers 
on them. Clearly, the sequential learning problem naturally 
fits into the AdaBoost framework. 




1 - e; ■ 

Update the pair weighting coefficients using 


(X exp{0ie(hi, hj, Sij)} 


Normalize the sampling weights 

Z —ij ^'^3 '^3 

end for 


such that 


weighted combination 

L 

sim{x,,Xj) = - iL„(h(xi;W;),h(xj; W/)), 

i=i 

(9) 

where 9i are the weighted training error of the hash 
function. 

We name this AdaBoost-inspired sequential learning by Se¬ 
quential RSH (SRSH), in contrast to the RSH algorithm 
with independently composed hash codes. 

5. Experiments 


The AdaBoost-based sequential learning algorithm is 
shown in Algorithmj^ In detail, a sampling weight afj is 
assigned to each training pair and is updated before training 
each new hash function. In particular, pairs that are mis- 
classified by the current hash function will be given more 
weight in training the next hash function. The projection 
matrix is updated in the similar online fashion as in Q but 
weighted by the sampling weight. 

When all the hash functions have been trained, the vot¬ 
ing results of the related week classifiers are fused with a 


5.1. Dataset and Compared Methods 


In order to evaluate the proposed hashing approaches. 
Rank Subspace Hashing (RSH) and Sequential Rank Space 
Hashing (SRSH), we use three well-known datasets: La- 
belMe and Peekaboom, two collections of images repre¬ 
sented as 512D Gist vectors designed for object recogni¬ 
tion tasks; and MNIST, a corpus of handwritten digits in 
24 X 24 greyscale image. The above datasets are assem¬ 
bled by (|Kulis & Darrellj |2009|l and also used in (pSlorouzi 
l&FleetlpmTT^ 



















Rank Subspace Learning for Compact Hash Codes 



(a) LabelMe (b) MNIST (c) Peekaboom 

Figure 1. Average precision with varying hash code length. 



(a) LabelMe (b) MNIST 



(c) Peekaboom 



Figure 2. Precision-recall curve when code length L = 32. 


Following the settings of (Norouzi & Fleet 20111, we ran¬ 
domly picked 1000 points for training and a separate set of 
3000 points as test queries. The groundtruth neighbors for 
test queries are defined by thresholding the Euclidean dis¬ 
tance such that each query point has an average of 50 neigh¬ 
bors. Similarly, we dehne the neighbors and non-neighbors 
of each data point in the training set in order to create the 
similarity matrix. All the datasets are mean-centered and 
normalized prior to training and testing. Some methods 
(e.g. SH) often perform better after dimensionality reduc¬ 
tion, we therefore apply PCA to all datasets and retain the 
top 40 directions for a fair comparison. 

For comparison, we choose several state-of-the-art meth¬ 
ods: Minimal Loss Hashing (MLH (Norouzi & Fleet] 


201 ![ )), Spectral Hashing (SH ( |Weiss et al.[ 
cality Sensitive Hashing (LSH (Qatar et al.j 


2009 0, Lo- 


2004 1 ), and 


Winner-Take-All (WTA ( [Yagnik et al. 201 1| |) hash. For 
MLH and SH, we use the publicly available source code 
provided by their original authors, while we implemented 
our own version of LSH and WTA since they are rather 
straightforward to implement. Those methods cover both 
supervised (e.g., MLH) and unsupervised (e.g., SH) hash¬ 
ing as well as data-agnostic ones (e.g., LSH and WTA), and 


are considered most representative in their own category. 

5.2. Methodology 

In evaluating Approximate Nearest Neighbor (ANN) 
search, two methods are frequently adopted in the litera¬ 
ture, that is, hash table based lookup and Hamming dis¬ 
tance based kNN search. We use both methods in our eval¬ 
uation. In hash table lookup, the hash code is used to index 
all the points in a database, and the data points with the 
same hash key fall into the same bucket. Typically, hash 
buckets that fall within a Hamming ball of radius R (i.e. 
the hash code differs by only 2 or 3 bits) of the target query 
are considered to contain relevant query results. A big ad¬ 
vantage of hash table lookup lies in that it can be done in 
constant time. In contrast, Hamming distance based kNN 
search performs a standard kNN searching procedure based 
on Hamming distance which involves a linear scan of the 
entire database. However, since Hamming distance can be 
computed efficiently, the kNN search in Hamming space is 
also very fast in practice. 

In our experiments, we evaluate the retrieval quality by 
setting R = 2 and 3 in the hash table lookup and k = 
50 and 100 in Hamming distance based kNN search. For 
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(a) LabelMe 


(b) MNIST 


(c) Peekaboom 


Figure 3. Precision of retrieval within a Hamming ball of radius R = 2. 



(a) LabelMe 


(b) MNIST 


(c) Peekaboom 


Figure 4. Precision of retrieval within a Hamming ball of radius R = 3. 


both evaluation protocols, we compute the retrieval pre¬ 
cision that is defined as the percentage of true neighbors 
among those returned by the query. The precision reflects 
the quality of hash codes to a large extent and it can be crit¬ 
ical for many applications. In addition, we also evaluate 
the average precision for hash table lookup by varying R, 
which approximates the area under precision-recall curve. 
For all benchmarks (unless otherwise specified), we run ev¬ 
ery algorithm 10 independent times and report the mean 
and the standard deviation. 


As for parameter settings, MLH requires a loss scaling fac¬ 
tor e, and two loss function hyper-parameters p and A. We 
follow the practice of MLH and perform cross-validation 
on a number of combinations to get the best performance 
at each code length. Similarly, our algorithm also has three 
hyper-parameter, that is, the subspace dimension K and 
two error term hyper-parameters as defined in (|^. A simi¬ 
lar cross validation procedure is used to find the best model. 
For WTA, we use the polynomial kernel extension and set 
window size K = A and polynomial degree p = 4, as 
suggested by ( Yagnik et al.[ 2011| l. SH and LSH are essen¬ 
tially parameter free and therefore do not require special 
handling. 


5.3. Results 


Figure [T] shows the average precision using different hash 
code length. We aim to compare the performance of dif¬ 
ferent hashing methods in generating compact hash code, 
therefore the code length is restricted below 64. It can 
be observed from the Figure [T| that the average precision 
of almost all methods increases monotonically when codes 
become longer, which is reasonable since longer codes re¬ 
tains more information of original data. The only exception 
to this trend is SH whose performance doesn’t increase or 
even slightly drops after exceeding certain number of bits 
(e.g. 24 to 32). This can be explained by fact that unsu¬ 
pervised learning methods tend to overfit more easily with 
longer codes, which is consistent with the observation by 
(Wangetal. 2012| l. 


We note that RSH shows significant improvement over 
WTA, another representative ranking-based hashing algo¬ 
rithm, as a result of the generalization of projection di¬ 
rections and the supervised learning process. Compared 
with RSH, SRSH further boosts the performance with large 
gains across all the tested datasets, demonstrating the ef¬ 
fectiveness of the sequential learning method. In gen¬ 
eral, SRSH achieves the best performance, with about 10% 
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lead over MLH. We also note that both of our algorithms 
demonstrate exceptional performance with extremely short 
code (e.g. of length less than 12) as a result of using rank 
order encoding. 


In addition to the average precision, we also show a more 
detailed precision-recall profile when the code length L is 
hxed to 32 in Figure In the precision-recall curve, bet¬ 
ter performance is shown by larger area under the curve. 
Again, both of our algorithms perform significantly better 
than WTA with SRSH consistently being the best, which is 
consistent with the previous results. 


The results of hash table lookups are shown in Fig. I^and 
Fig. 1^ for i? = 2 and R = i respectively. As explained in 
the previous section, precision alone is more critical than 
average precision that is an overall evaluation of both pre¬ 
cision and recall. Therefore, the results in Figure[^and Fig- 
urej^can be more important for such applications. In those 
tests, rank order based techniques (i.e. WTA, RSH and 
SRSH) generally perform better than numeric value based 
hashing schemes because of certain degree of resilience to 
numeric noises/perturbations. For example, although both 
WTA and LSH are based on data-agnostic random meth¬ 
ods, WTA clearly outperforms LSH for most of the tests. 


which is similar to the results obtained in ((Yagnik et al. 


|201![ )). However, we find that WTA sometimes fails to re¬ 
trieve any neighbor within a small Hamming ball, resulting 
in large standard deviation in precision at large code length 
(e.g. Fig. 3(b) and 4(b) I. This is a natural result of apply¬ 
ing randomness to a highly selective hash function. Such 
limitation is effectively addressed by providing certain su¬ 
pervision in obtaining the hash functions. Therefore, both 
RSH and SRSH produce more stable results than WTA, 
as demonstrated by the consistently smaller standard de¬ 
viations. Overall, SRSH performs the best in all the tests, 
again demonstrating its superiority in generating high qual¬ 
ity hash codes. 


The last group of experiments is the Hamming distance 
based kNN search, where we evaluate the precision of true 
neighbors among the 50 and 100 nearest neighbors mea¬ 
sured by Hamming distance. As shown in Figure and 
Figure the results are similar to those in the hash table 
lookups, except that there are no missed retrievals for any 
of the compared algorithms because all queries are guaran¬ 
teed to return the specihed number of results. The proposed 
algorithms both give competitive results as compared with 
the others. 


to optimize a number of low-dimensional linear subspaces 
for high quality rank order-based hash encoding. A simple 
yet effective learning algorithm is then provided to opti¬ 
mize the objective function, leading to a number of optimal 
rank subspaces. The effectiveness of the proposed learn¬ 
ing method in addressing the limitations of WTA is verihed 
in a number of experiments. We also embed our learning 
method into a sequential learning framework that pushes 
the performance of the basic learning algorithm even fur¬ 
ther. Extensive experiments on several well-known datasets 
demonstrated our superior performance over state-of-the- 
art. 
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